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Abstract 



The Random Projection Tree (RPTree) structures proposed in [T) are space par- 
titioning data structures that automatically adapt to various notions of intrinsic 
dimensionality of data. We prove new results for both the RPTree-Max and 
the RPTree-Mean data structures. Our result for RPTree-Max gives a near- 
optimal bound on the number of levels required by this data structure to reduce 
the size of its cells by a factor s > 2. We also prove a packing lemma for this data 
structure. Our final result shows that low-dimensional manifolds have bounded 
Local Covariance Dimension. As a consequence we show that RPTree-Mean 
adapts to manifold dimension as well. 

1 Introduction 

The Curse of Dimensionality fl2] has inspired research in several directions in Computer Science and 
has led to the development of several novel techniques such as dimensionality reduction, sketching 
etc. Almost all these techniques try to map data to lower dimensional spaces while approximately 
preserving useful information. However, most of these techniques do not assume anything about the 
data other than that they are are imbedded in some high dimensional Euclidean space endowed with 
some distance/similarity function. 

As it turns out, in many situations, the data is not simply scattered in the Euclidean space in a random 
fashion. Often, generative processes impose (non-linear) dependencies on the data that restrict the 
degrees of freedom available and result in the data having low intrinsic dimensionality. There exist 
several formalizations of this concept of intrinsic dimensionality. [ 1 1 provides an excellent example 
of automated motion capture in which a large number of points on the body of an actor are sampled 
through markers and their coordinates transferred to an animated avatar. Now, although a large 
sample of points is required to ensure a faithful recovery of all the motions of the body (which 
causes each captured frame to lie in a very high dimensional space), these points are nevertheless 
constrained by the degrees of freedom offered by the human body which are very few. 

Algorithms that try to exploit such non-linear structure in data have been studied extensively re- 
sulting in a large number of Manifold Learning algorithms for example J3]|4][5). These techniques 
typically assume knowledge about the manifold itself or the data distribution. For example, |]4) and 
[5 1 require knowledge about the intrinsic dimensionality of the manifold. |3 1 requires a sampling of 
points that is "sufficiently" dense with respect to some manifold parameters. 

Recently in Qj, Dasgupta and Freund proposed space partitioning algorithms that adapt to the in- 
trinsic dimensionality of data and do not assume explicit knowledge of this parameter. Their data 
structures are akin to the fc-d tree structure and offer guaranteed reduction in the size of the cells 
after a bounded number of levels. Such a size reduction is of immense use in vector quantization 
[6 1 and regression Q. |£TjJ presents two such tree structures, each adapting to a different notion of 
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intrinsic dimensionality. Both variants have already found numerous applications in regression [7], 
spectral clustering 1H, face recognition [9| and image super-resolution ifTUll . 

1.1 Contributions 

The RPTREE structures are new entrants in a large family of space partitioning data structures such 
as fc-d trees [ 1 1 1, BBD trees ||l2l . BAR trees 11131 and several others (see T\M for an overview). The 
typical guarantees given by these data structures are of the following types : 

1 . Space Partitioning Guarantee : There exists a bound L(s),s > 2 on the number of levels 
one has to go down before all descendants of a node of size A are of size A/s or less. The 
size of a cell is variously defined as the length of the longest side of the cell (for box-shaped 
cells), radius of the cell, etc. 

2. Bounded Aspect Ratio : There exists a certain "roundedness" to the cells of the tree - this 
notion is variously defined as the ratio of the length of the longest to the shortest side of the 
cell (for box-shaped cells), the ratio of the radius of the smallest circumscribing ball of the 
cell to that of the largest ball that can be inscribed in the cell, etc. 

3. Packing Guarantee : Given a fixed ball B of radius R and a size parameter r, there exists a 
bound on the number of disjoint cells of the tree that are of size greater than r and intersect 
B. Such bounds are usually arrived at by first proving a bound on the aspect ratio for cells 
of the tree. 

These guarantees play a crucial role in algorithms for fast approximate nearest neighbor searches 
|[T2"1 and clustering 1151 . We present new results for the RPTree-Max structure for all these types 
of guarantees. We first present a bound on the number of levels required for size reduction by any 
given factor in an RPTree-Max. Our result improves the bound obtainable from results presented 
in (T). Next, we prove an "effective" aspect ratio bound for RPTree-Max. Given the randomized 
nature of the data structure it is difficult to directly bound the aspect ratios of all the cells. Instead 
we prove a weaker result that can nevertheless be exploited to give a packing lemma of the kind 
mentioned above. More specifically, given a ball B, we prove an aspect ratio bound for the smallest 
cell in the RPTree-Max that completely contains B. 

Our final result concerns the RPTree-Mean data structure. The authors in 1 1 1 prove that this 
structure adapts to the Local Covariance Dimension of data (see Section [5] for a definition). By 
showing that low-dimensional manifolds have bounded local covariance dimension, we show its 
adaptability to the manifold dimension as well. Our result demonstrates the robustness of the notion 
of manifold dimension - a notion that is able to connect to a geometric notion of dimensionality such 
as the doubling dimension (proved in |1|) as well as a statistical notion such as Local Covariance 
Dimension (this paper). 

1.2 Organization of the paper 

In Section [2] we present a brief introduction to the RPTree-Max data structure and discuss its 
analysis. In Section[3]we present our generalized size reduction lemma for the RPTree-Max. In 
Section|4]we give an effective aspect ratio bound for the RPTree-Max which we then use to arrive 
at our packing lemma. In Section|5]we show that the RPTree-Mean adapts to manifold dimension. 

All results cited from other papers are presented as Facts in this paper. We will denote by B(x, r), 
a closed ball of radius r centered at x. We will denote by d, the intrinsic dimensionality of data and 
by D, the ambient dimensionality (typically d -C D). 

2 The RPTree-Max structure 

The RPTree-Max structure adapts to the doubling dimension of data (see definition below). Since 
low-dimensional manifolds have low doubling dimension (see |Q] Theorem 22) hence the structure 
adapts to manifold dimension as well. 

Definition 1 (taken from 1 16 1). The doubling dimension of a set S C R D is the smallest integer d 
such that for any ball B(x, r) C R D , the set B{x, r) n S can be covered by 2 d balls of radius r/2. 
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The RPTree-Max algorithm is presented data imbedded in M. having doubling dimension d. The 
algorithm splits data lying in a cell C of radius A by first choosing a random direction v € M. D , 
projecting all the data inside C onto that direction, choosing a random value 6 in the range [—1,1] • 
6A/VZ) and then assigning a data point x to the left child if x ■ v < median({z • v : z £ C}) + S 
and the right child otherwise. Since it is difficult to get the exact value of the radius of a data set, 
the algorithm settles for a constant factor approximation to the value by choosing an arbitrary data 
point x G C and using the estimate A = max({||x — y\\ '■ y € C}). 

The following result is proven in |[T| : 

Fact 2 (Theorem 3 in JlJ). There is a constant c\ with the following property. Suppose an RPTREE- 
Max is built using a data set S C MP . Pick any cell C in the RPTREE-MAX; suppose that 
S H C has doubling dimension < d. Then with probability at least 1/2 (over the randomization in 
constructing the subtree rooted at C), every descendant C more than cidlog d levels below C has 
radius(C') < radius(C) /2. 

In Sections [2] [3] and |4] we shall always assume that the data has doubling dimension d and shall 
not explicitly state this fact again and again. Let us consider extensions of this result to bound the 
number of levels it takes for the size of all descendants to go down by a factor s > 2. Let us analyze 
the case of s = 4. Starting off in a cell C of radius A, we are assured of a reduction in size by a 
factor of 2 after c\d log d levels. Hence all 2 Cldlogd nodes at this level have radius A/2 or less. Now 
we expect that after cidlog d more levels, the size should go down further by a factor of 2 thereby 
giving us our desired result. However, given the large number of nodes at this level and the fact 
that the success probability in Fact[2]is just greater than a constant bounded away from 1, it is not 
possible to argue that after c%d log d more levels the descendants of all these 2 Cld log d nodes will be 
of radius A/4 or less. It turns out that this can be remedied by utilizing the following extension of 
the basic size reduction result in [ 1 1. We omit the proof of this extension. 

Fact 3 (Extension of Theorem 3 in [ 1 1). For any 5 > 0, with probability at least 1 — 5, every descen- 
dant C' which is more than c\d\ogd + log(l/J) levels below C has radius(C') < radius(C) /2. 

This gives us a way to boost the confidence and do the following : go down L = cidlog d + 2 levels 
from C to get the the radius of all the 2 Cl d log d+2 descendants down to A/2 with confidence 1 — 1/4. 
Afterward, go an additional L' = c\d\ogd + L + 2 levels from each of these descendants so that 
for any cell at level L, the probability of it having a descendant of radius > A/4 after V levels is 
less than -Arr. Hence conclude with confidence at least 1 — \ — -rkrr ■ 2 L > 1 that all descendants 
of C after 2L + cidlog d + 2 have radius < A/4. This gives a way to prove the following result : 

Theorem 4. There is a constant ci with the following property. For any s > 2, with probability at 
least 1 — 1/4, every descendant C which is more than C2 • s ■ d log d levels below C has radius^') < 
radius(C) / s. 

Proof. Without loss of generality assume that s is a power of 2. We will prove the result by induc- 
tion. Fact[3]proves the base case for s = 2. For the induction step, let L(s) denote the number of 
levels it takes to reduce the size by a factor of s with high confidence. Then we have 

L(s) < L(s/2) + adlogd + L(s/2) + 2 = 2i(s/2) + cidlog d + 2 

Solving the recurrence gives L(s) = O (sdlog d) □ 

Notice that the dependence on the factor s is linear in the above result whereas one expects it to 
be logarithmic. Indeed, typical space partitioning algorithms such as fc-d trees do give such guar- 
antees. The first result we prove in the next section is a bound on the number of levels that is 
poly-logarithmic in the size reduction factor s. 

3 A generalized size reduction lemma for RPTree-Max 

In this section we prove the following theorem : 

Theorem 5 (Main). There is a constant C3 with the following property. Suppose an RPTree-Max 
is built using data set S C M. D . Pick any cell C in the RPTREE-MAX; suppose that S n C 
has doubling dimension < d. Then for any s > 2, with probability at least 1 — 1/4 ( over the 
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randomization in constructing the subtree rooted at C ), for every descendant C which is more than 
C3 • log s ■ d log sd levels below C, we have radius{C) < radius{C) / s. 

Compared to this, data structures such as [ 12 1 give deterministic guarantees for such a reduction in 
D log s levels which can be shown to be optimal (see (H for an example). Thus our result is optimal 
but for a logarithmic factor. Moving on with the proof, let us consider a cell C of radius A in the 
RPTree-Max that contains a dataset S having doubling dimension < d. Then for any e > 0, a 
repeated application of Definition Q] shows that the S can be covered using at most 2 dlog ( 1 / e ) balls 
of radius eA. We will cover S C~\ C using balls of radius — ^-7= so that O ((sd) d ) balls would 

A 



suffice. Now consider all pairs of these balls, the distance between whose centers is > — — = . 

f ' — s 960s\/d, 

If random splits separate data from all such pairs of balls i.e. for no pair does any cell contain data 
from both balls of the pair, then each resulting cell would only contain data from pairs whose centers 
are closer than — =-7?. Thus the radius of each such cell would be at most A/s. 

s 960s\/d 1 

We fix such a pair of balls calling them B\ and B 2 . A split in the RPTree-Max is said to be good 
with respect to this pair if it sends points inside B\ to one child of the cell in the RPTree-Max 
and points inside B2 to the other, bad if it sends points from both balls to both children and neutral 
otherwise (See Figure[T). We have the following properties of a random split : 

Lemma 6. Let B = B(x,S) be a ball contained inside an RPTree-Max cell of radius A that 
contains a dataset S of doubling dimension d. Lets us say that a random split splits this ball if the 
split separates the data set S into two parts. Then a random split of the cell splits B with probability 

almost 

Proof. The RPTree-Max splits proceed by randomly projecting the data in a cell onto the real 
line and then choosing a split point in an interval of length 12A/a/D. It is important to note that 
the random direction and the split point are chosen independently. Hence, suppose data inside the 
ball B gets projected onto an interval B of radius r, then the probability of it getting split is atmost 
ta/D/6A since the split point is chosen randomly in an interval of length l2A/yD independently 
of the projection. Let Rb be the random variable that gives the radius of the interval B. Hence the 
probability of B getting split is the following 

00 00 r 00 00 

JrW[R B =r]dr = ^ / J P[Rb = r]dtdr = ^ J Jp[R B =r]drdt 

00 t 

, 00 

J Pr[R B > t]dt 



fD 
~6A 



6A 

We have the following result from [ 1 ] 
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Fact 7 (Lemma 6 of JD). 



Rb > 7=1/2 



ln^ 

v 



< rj 



Fix the value I = -^L ^J2 (d + In 2). Using the fact that for any t, Pr[R B > t] < 1 and making the 



change of variables t = ^2 [d + In | ) we get 



00 1 00 l u 

J Pr[R B > t]dt = J Pr[R B > t]dt + J Pr[R B > t]dt < J Idt + J r)dt{rj) 
1 01 

Simplifying the above expression, we get the split probability to be atmost 



2S_ 
3A 



y/2(d + ln2) 



dij 



2(d+ln| 



26 
3A 



OO 

v/2(d + ln2) + 2V2e d J e'^dx 



Now f e x dx 



f e x dx - J 



dx 



< 



1-VT 



< ^-e- a2 since 1 



\/l — x < x for < x < 1. Using d > 1 , we get the probability of the ball B getting split to be 



atmost 



^2 (<f + ln2) 



< 



3(5 \fd 



□ 



Lemma 8. Let B\ and B2 be a pair of balls as described above contained in the cell C that contains 
data of doubling dimension d. Then a random split of the cell is a good split with respect to this pair 
with probability at least 

Proof. The techniques used in the proof of this lemma are the same as those used to prove a similar 
result in [1 1. We are giving a proof sketch here for completeness. We use the following two results 
from JT) 

Fact 9 (Lemma 5 of [1|). Fix any x E M D . Pick a random vector U ~ Af (0, (1/D)I B ). Then for 
any a, j3 > : 



1. 



\U-x\ < a- 



< 



\U-x\ > 



< 2 -/3^/2 



Fact 10 (Corollary 8 of [T]). Suppose S C R D lies within ball B(x, A). Pick any < 5 < 2/e 2 . 
Let this set be projected randomly onto the real line. Let us denote by x, the projection of x by S, 
the projection of the set S. Then with probability atleast 1 — S over the choice of random projection 

A 



onto 



median{S} 



< 



2 Inf. 



A 



Projections of points, sets etc. are denoted with a tilde (~) sign. Applying Fact|7]with ?; = we 
get that with probability > 1 — ^|r, the ball B\ gets projected to an interval of length atmost 
centered at £\. The same holds for £?2. Applying FactQEJwith a. = ||| gives us |fi — £2] > 



30sVD 



A 



2s\fD 



with probability 1 — jpff. Furthermore, an application of Fact with (5 = V 2 In 40 shows that 
with probability atleast 1 — gj, |fi — x\ < The same holds true for £2 as well. Finally an 
application of Fact [10] with S = shows that the median of the projected set S will lie within a 
distance of x (i.e. the projection of the center of the cell) with probability atleast 1 — 

Simple calculations show that the preceding guarantees imply that with probability atleast | over the 
choice of random projections, the projections of both the balls will lie within the interval from which 
a split point would be chosen. Further more there would be a gap of atleast A / — — 2 — A= between 

A A ^ L 2sV D 30sv D 
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the projections of the two balls. Hence, given that these good events take place, with probability 

atleast ^ — ^ 30s^/d ) ovel tne cr, °i ce °f tne split P°i nt > the balls will get cleanly separated. 
Note that this uses independence of the choice of projection and the choice of the split point. Thus 
the probability of a good split is atleast Jjj. □ 

Lemma 11. Let B\ and B2 be a pair of balls as described above contained in the cell C that 
contains data of doubling dimension d. Then a random split of the cell is a bad split with respect to 
this pair with probability at most 35^. 

Proof. The proof of a similar result in Q~| uses a conditional probability argument. However the 
technique does not work here since we require a bound that is inversely proportional to s. We instead 
make a simple observation that the probability of a bad split is upper bounded by the probability that 
one of the balls is split since for any two events A and B, P[iflB] < min{P [A] , P [B]}. The 
result then follows from an application of Lemma|6] □ 

We are now in a position to prove Theorem|5] What we will prove is that starting with a pair of balls 
in a cell C, the probability that some cell k levels below has data from both the balls is exponentially 
small in k. Thus, after going enough number of levels we can take a union bound over all pairs of 
balls whose centers are well separated (which are O ((sd) 2d ) in number) and conclude the proof. 

Proof, (of Theorem|5j Consider a cell C of radius A in the RPTree-Max and fix a pair of balls 
contained inside C with radii A/960s\/d and centers separated by at least A/s — A/960s\/d. Let 
Pj denote the probability that a cell i levels below C has a descendant j levels below itself that 
contains data points from both the balls. Then the following holds : 

Lemma 12. P ° k < 

Proof. We have the following expression for p° k : 

Pk < P [ s P nt at level is a good split] • + 

P [split at level is a bad split] • 2p 1 k _ 1 + 
P [split at level is a neutral split] • p k _ 1 

~ 3^" 2p *- 1+ ( 1 "3^"5^)* P *- 1 

(l-^)V 2 (Similax-ly^^l-^)^ 



< 



1 



68* y p ' k ~ l □ 



Note that this gives us p k < (l — g|j) as a corollary. However using this result would require us 
to go down k = Q(sdlog(sd)) levels before p k — fW^ro which results in a bound that is worse 
(by a factor logarithmic in s) than the one given by Theorem|4] This can be attributed to the small 
probability of a good split for a tiny pair of balls in large cells. However, here we are completely 
neglecting the fact that as we go down the levels, the radii of cells go down as well and good splits 
become more frequent. 

Indeed setting s = 2 in Theorems [8] and QT| tells us that if the pair of balls were to be contained in a 
cell of radius j% then the good and bad split probabilities are jj^ an d Bio respectively- This paves 
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way for an inductive argument : assume that with probability > 1 — 1/4, in L(s) levels, the size of 



all descendants go down by a factor s. Denote by p g the probability of a good split in a cell at depth 

I and by p l b the corresponding probability of a bad split. Set /* = L(s/2) and let E be the event that 

A 

772 ■ 



the radius of every cell at level I* is less than -jk. Let C represent a cell at depth /* . Then, 



1 A 1\ 1 



p' g > P [good split in CIS] . M-_j> i . |i 



p l b " = P [bad split in C \E] ■ P [E] + P [bad split in C'\^E] ■ P [-^E] 

1 111 

< H < 

~ 640 640 4 ~ 512 

Notice that now, for any m > 0, we have p 1 ^ < (l — 2T3) ■ Thus, for some constant C4, setting 



k = I* + Cid\og(sd) and applying Lemma [T2l gives us pi < (l — (l — ) C4 ° S S < 

4(s ^2d - Thus we have 

L(s) < i(s/2) + c 4 dlog(sd) 
which gives us the desired result on solving the recurrence i.e. L(s) = O (d log s log sd). □ 

4 A packing lemma for RPTree-Max 

In this section we prove a probabilistic packing lemma for RPTree-Max. A formal statement of 
the result follows : 

Theorem 13 (Main). Given any fixed ball B(x, R) C R D , with probability greater than 1/2 (where 
the randomization is over the construction of the RPTREE-MAX), the number of disjoint RPTREE- 

„ , ■ n- ( R\°( d lo S dk)g(dR/r)) 

MAX cells of radius greater than r that intersect B is at most 

Data structures such as BBD-trees give a bound of the form O (-p )° which behaves like 

for fixed D. In comparison, our result behaves like (^) ^° e r for fixed d. We will prove the 
result in two steps : first of all we will show that with high probability, the ball B will be completely 

inscribed in an RPTree-Max cell C of radius no more than O (^Rd\fd\og d^j . Thus the number of 

disjoint cells of radius at least r that intersect this ball is bounded by the number of descendants of 
C with this radius. To bound this number we then invoke Theorem[5]and conclude the proof. 

4.1 An effective aspect ratio bound for RPTree-Max cells 

In this section we prove an upper bound on the radius of the smallest RPTree-Max cell that 
completely contains a given ball B of radius R. Note that this effectively bounds the aspect ratio 
of this cell. Consider any cell C of radius A that contains B. We proceed with the proof by first 
showing that the probability that B will be split before it lands up in a cell of radius A/2 is at most 
a quantity inversely proportional to A. Note that we are not interested in all descendants of C - only 
the ones ones that contain B. That is why we argue differently here. We consider balls of radius 
A/512-\/d surrounding B at a distance of A/2 (see Figure©. These balls are made to cover the 
annulus centered at B of mean radius A/2 and thickness A/512v / d - clearly d°^ balls suffice. 
Without loss of generality assume that the centers of all these balls lie in C. 

Notice that if B gets separated from all these balls without getting split in the process then it will 
lie in a cell of radius < A/2. Fix a Bi and call a random split of the RPTree-Max useful if 
it separates B from Bi and useless if it splits B. Using a proof technique similar to that used in 
Lemma|8]we can show that the probability of a useful split is at least ^ whereas Lemma|6]tells us 

that the probability of a useless split is at most SR ^ . 

Lemma 14. There exists a constant C5 such that the probability of a ball of radius R in a cell of 
radius A getting split before it lands up in a cell of radius A/2 is at most c s Rd ^ lo & d _ 

Proof. The only bad event for us is the one in which B gets split before it gets separated from 
all the Bj's. Call this event E. Also, denote by E[i] the bad event that B gets split for the first 
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useful split 




useless split 



Figure 2: Balls Bi are of radius A/512v / d and their centers are A/2 far from the center of B. 




time in the i th split and the preceding i — 1 splits are incapable of separating B from all the Bj 's. 
Thus P [E] < P Since any given split is a useful split (i.e. separates B from a fixed Bj) 

i>0 

with probability > — L, the probability that i — 1 splits will fail to separate all BjS from the -B 
(while not splitting B) is at most min |l. (l — y^)* 1 • n\ where N = d°^ is the number of 
balls Bj. Since all splits in an RPTree-Max are independent of each other, we have P [E[i]} < 
min jl, (1 - j^Y' 1 ■ n\-^&. Let k be such that (l - y^)^ 1 < jft. Clearly k = O (d log d) 
suffices. Thus we have 

mr , 3RVd 
P £ < — -— 
L J ~ A 

which gives us P [E] = O ^IM^^2SA^j since the second summation is just a constant. □ 

We now state our result on the "effective" bound on aspect ratios of RPTree-Max cells. 

Theorem 15. There exists a constant cq such that with probability > 1 — 1/4, a given (fixed) ball 
B of radius R will be completely inscribed in an RPTREE-MAX cell C of radius no more than 
c 6 • Rd^fdlogd. 

Proof. Let A* = Ac$Rd\fd\ogd and A max be the radius of the entire dataset. Denote by F[i] the 
event that B ends up unsplit in a cell of radius A ^ ax . The event we are interested in is F[m] for 
m = log A ^"' x . Note that P [_F[m]|F[m — 1]] is exactly P [E] where E is the event described in 
Lemma [T4lfor appropriately set value of radius A. Also P [F[m]|^F[m — 1]] = 0. Thus we have 

p L i>]] = n p [fh + i]\F[i\] = nfi - cs ^°f d ) > i - g C5R A dVd '° 2 ? d 

2—0 l= \ / l= 

^ 'y^ 1 c b Rd\fd\ogd _ 1 y^ 1 1 > 1 

i=0 i=0 

Setting cq = 4c 5 gives us the desired result. □ 

Proof, (of Theorem \13[ Given a ball B of radius R, Theorem Q3] shows that with probability at 
least 3/4, B will lie in a cell C of radius at most R' = O ^RdVdlogd^j . Hence all cells of 
radius atleast r that intersect this ball must be either descendants or ancestors of C. Since we want 
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Figure 3: Locally, almost all the energy of the data is concentrated in the tangent plane. 



an upper bound on the largest number of such disjoint cells, it suffices to count the number of 
descendants of C of radius no less than r. We know from Theorem [5] that with probability at least 
3/4 in log(i?' /r)dlog(dR' /r) levels the radius of all cells must go below r. The result follows by 
observing that the RPTree-Max is a binary tree and hence the number of children can be at most 
2 log(fl7r)dlog(dfl70 The success probability is at least (3/4) 2 > 1/2. □ 

5 Local covariance dimension of a smooth manifold 

The second variant of RPTREE, namely RPTree-Mean, adapts to the local covariance dimension 
(see definition below) of data. We do not go into the details of the guarantees presented in |fl] due 
to lack of space. Informally, the guarantee is of the following kind : given data that has small local 
covariance dimension, on expectation, a data point in a cell of radius r in the RPTree-Mean will 
be contained in a cell of radius cj ■ r in the next level for some constant cj < 1. The randomization 
is over the construction of RPTree-Mean as well as choice of the data point. This gives per-level 
improvement albeit in expectation whereas RPTree-Max gives improvement in the worst case but 
after a certain number of levels. 

We will prove that a d-dimensional Riemannian submanifold M. of M. D has bounded local covari- 
ance dimension thus proving that RPTree-Mean adapts to manifold dimension as well. 

Definition 16. A set S C M. D has local covariance dimension (d,e,r) if there exists an isometry 
M of MP under which the set S when restricted to any ball of radius r has a covariance matrix for 
which some d diagonal elements contribute a (1 — e) fraction of its trace. 

This is a more general definition than the one presented in [ 1| which expects the top d eigenvalues 
of the covariance matrix to account for a (1 — e) fraction of its trace. However, all that [ 1 1 requires 
for the guarantees of RPTree-Mean to hold is that there exist d orthonormal directions such that 
a (1 — e) fraction of the energy of the dataset i.e. J2xes W x — mean(S)\\ 2 is contained in those d 
dimensions. This is trivially true when M. is a d-dimensional affine set. However we also expect 
that for small neighborhoods on smooth manifolds, most of the energy would be concentrated in the 
tangent plane at a point in that neighborhood (see Figure|3]l. Indeed, we can show the following : 

Theorem 17 (Main). Given a data set S C Ai where M. is a d-dimensional Riemannian manifold 

with condition number t, then for any e < \, S has local covariance dimension (^d, e, -^p-Y 

For manifolds, the local curvature decides how small a neighborhood should one take in order to 
expect a sense of "flatness" in the non-linear surface. This is quantified using the Condition Number 
t of Ai (introduced in ifTTl ) which restricts the amount by which the manifold can curve locally. 
The condition number is related to more prevalent notions of local curvature such as the second 
fundamental form |[T8l in that the inverse of the condition number upper bounds the norm of the 
second fundamental form ATI . Informally, if we restrict ourselves to regions of the manifold of 
radius r or less, then we get the requisite flatness properties. ifTTIl formalizes this as follows. For 
any hyperplane T C M. D and a vector v £ R d , let Um (T) denote the projection of v onto T. 
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Fact 18 (Implicit in Lemma 5.3 of B17ID . Suppose M. is a Riemannian manifold with condition 
number t. For any p £ M and r < \fer, e < \, let M' — B(p, r) n M. Let T — T p {M) be the 
tangent space at p. Then for any x, y £ M! , \x\\ (T) — y» (T)|| 2 > (1 — e) 1 1 — y\\ 2 . 

This already seems to give us what we want - a large fraction of the length between any two points 
on the manifold lies in the tangent plane - i.e. in d dimensions. However in our case we have 
to show that for some d-dimensional plane P, Ylxes II ( x — AOy (^ > )ll 2 > (1 — e ) 12 x es \\ x ~ HI 2 
where /i = mean(S). The problem is that we cannot apply Fact[18] since there is no surety that the 
mean will lie on the manifold itself. However it turns out that certain points on the manifold can act 
as "proxies" for the mean and provide a workaround to the problem. 

Proof, (of Theorem [TTTi Suppose M 1 = B(xq, r)flM for r = and we are given data points 

5 = {xx, . . . x n } C M' . Let q = argmin — x\\ be the closest point on the manifold to the mean. 

The smoothness properties of Ai tell us that the vector (/i — q) is perpendicular to T q (A4), the d- 
dimensional tangent space at q (in fact any point q at which the function g : x £ M i — > \\x — p\\ 
attains a local extrema would also have the same property). This has interesting consequences - let 
/ be the projection map onto T q (M ) i.e. f(v) = v\\ (T q (M)). 

Then /(/i — q) = since (/x — q) _L T q {M). This implies that for any vector v G M. D , f(v — p) = 
f(v — q) + f(q — p) = f(v — q) = f(v) — f(q) since / is a linear map. We now note that 
min||/i — Xi\\ < r. If this were not true then we would have II/ 1 ~ ^ill > nr2 whereas we know 

i i 

that II/ 1 — x i\\ < \\ x o ~ x i\\ < nr2 since for any random variable X £ K D and fixed v £ M. D , 

i i 

we have E [\\X - v\\ 2 ] > E [\\X - E [X] || 2 ] . Since \\p ~ Xi\\ < r for some Xi £ M, we know, by 
definition of q, that \\p — q\\ < r as well. 

We also have \\p ~ x \\ < r (since the convex hull of the points is contained in the ball B and the 
mean, being a convex combination of the points, is contained in the hull) and \\xi — xq\\ < r for all 
points Xi. Hence we have for any point Xi, \\xi — <?|| < \\xi — x$\\ + \\xq — p\\ + \\p — q\\ < 3r and 
conclude that S C B(q, 3r) fl M. = B(q, y/er) n M which means we can apply Fact [181 between 
the vectors Xi and q. 

Let T = T q (ftA) and q as chosen above. We have 

j2\\(x-ti {] (T)\\ 2 = En/( a; -Hii 2 = Eii/( a; -9)ii 2 = Eii/( a; )-/(9)ii 2 

xes xes xes xes 

xes xes 
where the last inequality again uses the fact that for a random variable X £ R D and fixed v £ R D , 

E[||^-«|| 2 ] >E[\\X -E[X}\\ 2 ]. □ 

6 Conclusion 

In this paper we considered the two random projection trees proposed in [1]. For the RPTREE- 
Max data structure, we provided an improved bound (Theorem|5]l on the number of levels required 
to decrease the size of the tree cells by any factor s > 2. However the bound we proved is poly- 
logarithmic in s. It would be nice if this can be brought down to logarithmic since it would directly 
improve the packing lemma (Theorem [T3l as well. More specifically the packing bound would 

become (^) ^ instead of (■^) ^° S T ' for fixed d. 

As far as dependence on d is concerned, there is room for improvement in the packing lemma. We 
have shown that the smallest cell in the RPTree-Max that completely contains a fixed ball B of 

radius R has an aspect ratio no more than O (ds/d log dj since it has a ball of radius R inscribed in 

it and can be circumscribed by a ball of radius no more than O (^RdVd log d\ Any improvement in 

the aspect ratio of the smallest cell that contains a given ball will also directly improve the packing 
lemma. 
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Moving on to our results for the RPTree-Mean, we demonstrated that it adapts to manifold di- 
mension as well. However the constants involved in our guarantee are pessimistic. For instance, 
the radius parameter in the local covariance dimension is given as ^ — this can be improved to 

if one can show that there will always exists a point q g B(xq, r) n M. at which the function 
j:igM i — > \\x — attains a local extrema. 

We conclude with a word on the applications of our results. As we already mentioned, packing 
lemmas and size reduction guarantees for arbitrary factors are typically used in applications for 
nearest neighbor searching and clustering. However, these applications (viz lfl2l . Tl5l ) also require 
that the tree have bounded depth. The RPTree-Max is a pure space partitioning data structure that 
can be coerced by an adversarial placement of points into being a primarily left-deep or right-deep 
tree having depth il(n) where n is the number of data points. 

Existing data structures such as BBD Trees remedy this by alternating space partitioning splits with 
data partitioning splits. Thus every alternate split is forced to send at most a constant fraction of 
the points into any of the children thus ensuring a depth that is logarithmic in the number of data 
points. [7] also uses a similar technique to bound the depth of the version of RPTree-Max used 
in that paper. However it remains to be seen if the same trick can be used to bound the depth of 
RPTree-Max while maintaining the packing guarantees because although such "space partition- 
ing" splits do not seem to hinder Theorem [5] they do hinder Theorem [T3l (more specifically they 
hinder Theorem[T4b. 

We leave open the question of a possible augmentation of the RPTree-Max structure, or a better 
analysis, that can simultaneously give the following guarantees : 

1. Bounded Depth : depth of the tree should be o(n), preferably (logn)^ 1 ) 

2. Packing Guarantee : of the form (-^) ° S T 

3. Space Partitioning Guarantee : assured size reduction by factor s in (dlog s)°^ levels 
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